Graph autoencoder with mirror temporal convolutional networks for traffic anomaly detection

Traffic time series anomaly detection has been intensively studied for years because of its potential applications in intelligent transportation. However, classical traffic anomaly detection methods often overlook the evolving dynamic associations between road network nodes, which leads to challenges in capturing the long-term temporal correlations, spatial characteristics, and abnormal node behaviors in datasets with high periodicity and trends, such as morning peak travel periods. In this paper, we propose a mirror temporal graph autoencoder (MTGAE) framework to explore anomalies and capture unseen nodes and the spatiotemporal correlation between nodes in the traffic network. Specifically, we propose the mirror temporal convolutional module to enhance feature extraction capabilities and capture hidden node-to-node features in the traffic network. Morever, we propose the graph convolutional gate recurrent unit cell (GCGRU CELL) module. This module uses Gaussian kernel functions to map data into a high-dimensional space, and enables the identification of anomalous information and potential anomalies within the complex interdependencies of the traffic network, based on prior knowledge and input data. We compared our work with several other advanced deep-learning anomaly detection models. Experimental results on the NYC dataset illustrate that our model works best compared to other models for traffic anomaly detection.


Graph convolution networks
Recently, Graph Neural Network (GNN) variants, such as Graph Convolutional Networks (GCN), have demonstrated ground-breaking performances on many deep-learning tasks.In addition, it is modular, scalable, stronger in generalization ability, and explores insights that direct further research 14 .GCN captures the complex dependencies of node embeddings through information across vertices 15 .Due to these powerful features, in variants of GCN, the sensors on the road of the traffic network are considered nodes in intelligent transportation, and each node's traffic speed or flow rate is regarded as a dynamic input feature.Among them, the Graph attention network (GAT) updates the node features through a pairwise function between the nodes with learnable weights 16 .However, it only computes one restricted form of static attention.To address this limitation, GATv2 17 introduces dynamic attention alongside static attention, allowing for more dynamic and adaptive computation of graph attention.In the subsequent development of the GCN, CorrSTN 18 effectively incorporates correlation information into the spatial structure.PDFormer 19 captures both short-range and long-range spatial dependencies by utilizing various graph masking, which enables the learning of dynamic urban traffic patterns and overcomes the restriction of modeling spatial dependencies statically.Moreover, STAEformer 20 takes into account the intrinsic spatial-temporal relationships and temporal ordering information in traffic time series.These methods are widely used in traffic forecasting, while graph embedding for traffic anomaly detection is less studied.For example, ST-Decompn solves the legal problem caused by changes in location and time in traffic cities through decomposition, as well as anomalies that may show up differently in the face of different datasets 21 , ConGAE detects traffic anomalies using semi-supervised frameworks such as autoencoders only for OD (origin-destination pairs) datasets on data washing and high dimensionality 13 .Besides, Graph Convolutional Adversarial Network (STGAN) uses adversarial training, which is divided into three modules to capture different features respectively: the recent module for local, the trend module for Long-term, and the external module for other traffic dynamics and anomalies, but the unsupervised learning, like an adversarial neural network, brings instability for anomaly detection 3 .Influenced state of the art, we borrowed the graph convolutional gated recurrent unit (GCGRU) 22 simultaneously to solve the problem of Spatiotemporal characteristics of traffic anomalies.Our work is focused on the traffic anomaly prediction capabilities of GCN.
Graph autoencoders (GAEs) are a kind of unsupervised learning method, which means they map nodes to a potential vector space through an encoding process, reconstructing graph information from the vector to generate a graph similar to the original one (decoding) 15,23 .For example, ADN 24 is a graph autoencoder structure and achieves information diffusion through alternating spatial and temporal self-attention.Due to the power of GAE 25 , it is widely used in different research directions, such as link prediction [26][27][28][29][30] , graph clustering 31,32 , hyperspectral anomaly detection 33 .While the traditional GCN takes node features and adjacency matrix as input and node embedding as output, GAEs compresses the node embeddings of all nodes in a graph to a single graph embedding to obtain information about the context.

Temporal convolutional networks
Earlier research methods focus on traffic-related problems but have shown significant inaccuracies in anomaly prediction.Deep learning has gradually dominated time series prediction tasks with sophisticated data modeling capabilities and autonomous learning abilities in recent years.Most studies in the field of transportation rely on gated linear Unit (GLU) 34 , or gated recursive units (GRU) 35 to capture the dynamic temporal correlation of time series data.Moreover, based on the transformer architecture, STGM 36 introduces a novel attention mechanism to capture both long-term and short-term temporal dependencies.Temporal convolutional networks (TCNs) also have significant advantages in addressing temporal dependencies, especially in time series prediction tasks.However, most traffic flow anomaly prediction frameworks use the original Temporal Convolutional Network (TCN) 37,38 structure without modification, and traffic anomaly detection is still under-explored.In this study, we have enhanced the TCN to better detect anomalies within this domain, allowing for a more comprehensive analysis of time series data.

Autoencoder-based anomaly detection
The autoencoder, an unsupervised neural network, has seen significant success across various fields.This success is largely due to its superior ability to discriminate between abnormal and regular inputs, making it widely used in anomaly detection [39][40][41][42][43][44] .In the field of graph convolutional networks (GCN), GCN-based autoencoders are also employed for anomaly detection [45][46][47][48] .They are mainly studied in graph embedding, which is consistent with the direction of our work, thanks to the network structure of the graph, which can connect various points in the intricate world for anomaly detection.

Methodology
Although many traffic anomaly detection methods have achieved optimal performance, they often overlook the hidden relationships between nodes during the detection process.For instance, traffic congestion during peak periods upstream can impact downstream traffic.This oversight results in many models lacking the ability to capture long-term temporal correlations, spatial characteristics, and high periodic trends.To address this, we aim to identify abnormal information and potential anomalies in the complex interdependencies among nodes in traffic networks.Consequently, we propose a traffic anomaly detection framework, MTGAE, with node interaction (see Fig. 2).
MTGAE consists of two main modules: MTCM and GCGRU CELL.The original input first passed through an adaptive process.This allows our module to better self-adapt to existing datasets by converting graph signals in low-dimensional spaces into potential vectors in high-dimensional spaces.Then we construct MTCM and GCGRU CELL.Specifically, we built MTCM to expand the hidden information in spacetime.MTCM internally expands x to the latent variables x m by mirror flip, and increases dilation factors and generates the hidden states H to capture both long-term spatiotemporal complex dependencies combining with TCN.Meanwhile, we built GCGRU CELL module to capture long-term and short-term 84 dependent anomalies in the traffic network.It combines original inputs and the hidden spatiotemporal states H as prior information.We first redistribute it through the Gaussian kernel module but without changing the overall structure of the traffic network (see in Fig. 2), then combine with our GCN modules to extract more spatiotemporal information.Subsequently, based on the output of the first GCGRU CELL, the spatio-temporal information h (t) 1 and MTCM's hidden information H, the second GCGRU CELL module adds more hidden details to correct the defects generated.Finally, we link the reconstructed results with the loss function to determine whether there are anomalies.In this section, we introduce the details of the MTGAE.

Problem definition
In this paper, traffic anomaly is monitored and detected in discrete time series T ∈ (t 1 , t 2 , . . ., t n ) .We denote the adjacency matrix representation graph as G(T) = (V , E, W) where V indicates different nodes, such as two nodes v i and v j , E denotes the set of edges between two nodes and W is the weighted adjacency matrix.A larger weight between two nodes means they are closer in the road networks and vice versa (see Fig. 1).Given G(T) = (V , E, W) , we aim to find the abnormal event t a ∈ T in the graph G that disrupts the regular traffic operation.
We aim to find the event t a ∈ T in the graph G that disrupts the regular traffic operation.We get the hid- den state through a specially designed contextual encoder, embed the information as a coded low-dimensional embedding, and then decode it to derive the average reconstruction error that minimizes the weighted adjacency matrix.It should be noted that our model is specifically trained using data representing normal traffic conditions.Consequently, when an anomaly occurs in the traffic operation, it deviates significantly from this 'normal' baseline.This deviation is captured as a high reconstruction error by our model, effectively indicating the presence of an anomaly.

Encoder
Our encoder process comprises three steps: the adaptive process, the mirror temporal convolutional module (MTCM), and the graph convolutional neural network recurrent cell (GCGRU CELL).Initially, the original data, denoted as x, passes through the adaptive process, and MTCM is constructed to capture the evolving states that are not visible in the spacetime continuum among the road network nodes over time.In the GCGRU CELL, based on prior knowledge of the hidden states H from MTCM, our GCN layer, through the Gaussian kernel module, explores potential anomalies in the complex interdependencies between nodes.The encoder is trained to learn up to 24 hours in a day and 7 days in a week, facilitating interaction between the GCGRU CELL and a full connection (refer to Fig. 2).Finally, the graph embedding h   The Encoder includes an adaptive process, the MTCM, and the GCGRU CELL.The MTCM is designed to effectively capture relevant information from data of variable length.It incorporates the TCN for processing data that has undergone a 'mirroring' transformation, adjusting the length of mirrored data back to its original state prior to input.The GCGRU CELL, comprising a Gaussian Kernel and Graph Convolutional Networks (GCNs), is instrumental in mapping finite-dimensional data to a higher dimensional space.This mapping aids in anomaly detection while preserving data distribution.The GCNs within the GCGRU CELL leverage the GRU architecture to extract spatial-temporal dependencies.Lastly, the Decoder's primary function, facilitated by the Bilinear module, is to resample features derived from the Encoder's output, enhancing the overall data interpretation process.

Mirror temporal convolutional module (MTCM)
Inspired by TCN 11 (see Fig. 3a), we proposed a superior module named MTCM that wides application in traffic prediction.Although TCNs can use the extended convolution to expand the perceptual field, they are weaker than advanced networks (e.g., Transformer) which can use correlation information of arbitrary length.Moreover, TCNs need strong adaptability to different historical information, which may have uneven predictive power and perceptual field.To overcome the above situations, we adapt the TCN before transmitting the traffic network features to reduce the fluctuation of different historical information on the ability of the TCN.We then perform a mirror flip to further preserve the features and capture the complex hidden relationships and dependencies between nodes in the traffic network.This explores the potential associations between nodes.Furthermore, thanks to the one-dimensional convolution of the TCN, we can keep the output sequence consistent with the original input in length.Finally, this output sequence will be passed as the subsequent hidden state H.More formally, for a 1-D sequence input x = {x 1 , x 2 , x 3 , . . .x i } ∈ R i and a filter f : {0, . . ., j − 1} → R , k is the kernel size (the kernel size in the Fig. 3 is 2), and d is the causal factor (see Fig. 3a).The dilated convolution operation f on element x of the sequence is defined as: where x m is the sequence input in mirror flipping, ⊕ denotes concatenate.This further increases the range of perceptual field and prevents more historical data from being lost in the process of inflated convolution.

GCGRU CELL
It mainly includes the Gaussian kernel module and GCN layer.We did not adopt GRU model (as shown in Fig. 4) but construct the GCN model inspired by GRU after the Gaussian kernel module.In GCGRU CELL, we replace the original gated cyclic unit of GRU to our GCN, which has the following two significant: the reset gate helps to capture short-term dependencies in the sequence, and the update gate helps to capture long-term dependencies in the sequence.This effectively predicts both long-term and short-term traffic network cycles, and combine with (1) Gaussian kernel processing and prior knowledge H (hidden information of MTCM), GCN can capture anomaly information and possible anomalies in complex interdependencies among nodes while predicting.Unlike image data, Graph convolution is an essential operation to extract a node's features.Figure 3b gives examples of an origin node (orange node) to take the average value of the node features within its neighbours (white nodes in ellipse).
(1) Gaussian kernel module To further enhance the anomaly detection capability of our module, we employ Gaussian kernel function.It could maintain the ability of high-dimensional data distribution characteristics, which is crucial for traffic network anomaly detection.Specifically, Gaussian kernels facilitate the mapping of data from its original space to a higher-dimensional feature space where complex traffic network patterns and potential anomalies are more easily identified and processed.Moreover, Gaussian kernel exhibit the stability: It could manage minor fluctuations by adjusting learned scale parameter σ (see Eq. 2) or utilizing a minimax strategy 49 , thereby ensuring more stable anomaly detection results.In summary, embedding Gaussian kernels in the GCGRU CELL module aims to enhance the model's performance and accuracy in detecting anomalies within complex traffic networks.Experimental data demonstrate that using Gaussian kernels to alter the data distribution effectively improves the accuracy of traffic anomaly detection (see Table 3).Building on this foundation, we further explored the anomaly detection capabilities of the Gaussian kernel module.As depicted in Fig. 5, we performed an intermediate variable exploration of the eight feature points generated by 490 edges entering the GCGRU CELL.This demonstrates the stability and the data mapping capability of our module by conducting visualization operations on intermediate variables before and after integrating the Gaussian kernel module into the GCGRU cell.Throughout the experimentation, the overall structure of the data remains unchanged, ensuring consistency and reliability.Our GCGRU CELL receives two input modes.The first input is from the original input x after adaptation and receives the hidden information H from the MTCM.Then set the input as H apt = x + H .The second input is the output of the first GCGRU CELL h 1 , which also receives hidden information H.We also set this input as 1 + H .Then, the formula calculated by the Gaussian kernel module is as follows: where g(H apt ) is generated based on the learned scale σ (we usually set the value between 0.5 and 1) and the i-th element σ i corresponds to the i-th time point.Specifically, for the i-th time point, its association weight to the H apt -th point is calculated by the Gaussian kernel.
(2) GCN layer Generally, the traffic network is presented as a weighted digraph.Traditional graph convolution networks only operate on adjacent nodes, which results in better short-term prediction than longterm prediction.Therefore, the spectral graph theory is used in this paper.Let G = (V , E, W) and establish spectral matrix L = I − D−1/2 Â D−1/2 , where I is the identity matrix and D is the degree matrix, Â is the adjacent matrix.To explore deeper and more complex traffic networks, we extend the graph convolution network to a higher level and divide the traffic graph g(x) sent by the Gaussian kernel module into subgraph GA sub = {g(H apt1 ), g(H apt2 ) . . .g(H aptn )} , and the subgraph considers its neighbour nodes GA sub−neighbour , which achieves more high-order information aggregation. where i represents learnable weights, and r (t) denotes the computed results of graph convolution as time tincreases.
(2) The image shows the GRU architecture, on which we were inspired to change the gating unit of GRU to GCN, giving it the same ability as GRU to capture short-term and long-term dependencies in a sequence.The s (t) update gating and z (t) reset gating are reflected in the derivation equations of GCGRU in this paper.
In a separate aspect, the use of GRU 49 simplifies the model, reducing complexity and enabling a faster, more effective characterization of sentence semantics.Compared to LSTM, GRU reduces the number of gating parameters, utilizes fewer training parameters, requires less memory, and offers faster execution and training.Owing to these advantages, our model adopts the GRU architecture over the traditional LSTM approach.We have transformed the gating unit into a graph convolution layer, as outlined in Eq. ( 3).This adaptation allows the GRU architecture to imitate the gating unit effectively.Consequently, the GCN layer can discern more hidden states from data processed by the Gaussian kernel module, capturing the dynamic spatial correlations within the traffic network and identifying previously unseen network connections.Formally, where h t−1 is the previous memory state, W z , W s and U z , U s , U are the weight parameters, r (t) is the current feature input, and σ is a sigmoid activation function.We combine GCN and GRU to capture the long-term dependen- cies between nodes in the graph.
(3) Graph Embedding (GE) We construct a time embedding (referred to as the GE module in Fig. 2) after the second GCGRU CELL to effectively capture the intricate weekly and hourly periodicity inherent in the mobility data.The time embedding consists of two components: h hour ∈ R day represents the time of day embedding, and h day ∈ R week represents the day of week embedding.For example, at a specific time t (e.g., 13:00 on Saturday, July 30), we use h hour (i.e., 13:00) and h day (i.e., Saturday) as the time embeddings.These embeddings serve the purpose of incorporating additional temporal information as context for the conditioned encoder and decoder.

By incorporating these temporal factors as graph embeddings h G (t)
, the model could accurately capture and represent the patterns and variations in mobility data associated with different times and days.
where h G (t) is the graph embedding at time t, Û from the formula 4 and U G is weight matrix.

Decoder
In the decoder, we begin with information extraction about the node embedding from the graph embedding h G

(t)
. For each pair of node embeddings (v i , v j ) , we embed the time information h hour (t) , h day (t) into the information of ( 4) each pair of nodes and compute the corresponding weight w ij in the weighted adjacency matrix.We then combine these node embeddings and time embeddings to form a graph embedding information h ′ G (t) that varies over time t.It contains both the information of the nodes and the time information (that is, the embedding includes the collective features of all nodes in the graph at that moment t).Subsequently, a fully connected layer is used to process this graph embedding, to recover useful vector representations from it.After processing by the fully connected layer, the vectors i and j, corresponding to h i (t) and h j (t) , are unstacked to recover the embedding of each individual node at time t.Consequently, the outcome of this process is the embedding representation h n (t) of a particular node n under specific time t conditions.Finally, to obtain the reconstructed edge weights, we first used the ReLU activation function to process the graph embeddings, resulting in a feature vector Ŵij (t) that has undergone a nonlinear transformation.Then, the reconstructed edge weights W R ij (t) are obtained from the feature vector Ŵij (t) and the Sigmoid function.
The presence of a bilinear module in the decoder is significant.The bilinear module applies a transformation to the incoming data, serving two main benefits: 1) The bilinear module ensures that edge weight predictions consider directionality.In the directed graph, the edge weight from node i to node j could differ from the weight from node j to node i. 2) The bilinear module employs the formula w i j = h i (t) Ah j (t) to calculate the edge weights, where A is a learned parameter.This approach enables the model to distinguish edge weights based on direction, more accurately depicting directed graph relationships.
where dec G is weight matrix, dec 1 is the weight matrix of feature vector Ŵij (t) and dec 2 is the weight matrix of W R ij (t) .The Sigmoid ensure the output W R ij (t) ∈ [0, 1].

Loss function
We use the mean squared error (MSE) as the loss function, a measure of the difference between the actual value y and the predict ŷ , to evaluate our model.Formally: where i is the value of each point in the sequence.And the reconstructed weights are W R ij (t) and the actual weights are W ij .During testing, the loss function Eq. ( 7) for each testing instance is used as its anomaly score.

Experiments Datasets and implementation
To ensure the model's credibility, we focused on general datasets that target traffic anomaly detection in our experiments.We verify our MTGAE method on two public traffic network datasets.
• PEMS-BAY dataset: It is collected in real-time from nearly 40,000 individual detectors spanning the freeway system across all major metropolitan areas of California 50 .The dataset comprises 365 sensors located in the bay area, and it contains traffic data recorded from April to May 2014.For our analysis, we selected a subgraph of six sensors, each with recorded speed and traffic flow information pertaining to our network.Furthermore, we extended the duration of each traffic incident from CHP (CHP Traffic Incident Information https:// www.chp.ca.gov/ traffi c), by one hour to account for the impact of traffic accidents.• New York City (NYC) taxi dataset: The New York City (NYC) taxi trips dataset is publicly released by the Taxi and Limousine Commission (TLC).We use it to record the time and location of each taxi pick-up and drop-off and pool the records formed for each hour of that taxi into a matrix.This dataset includes six months of data, from January 2019 to March 2019.Since the NYC dataset lacks exception tagging points, we utilized exception injection to add exceptions into the timing of the dataset 51,52 .

Baselines
To validate our method's effectiveness in anomaly detection within the NYC dataset.We obtained these methods from their official public code repositories and employed their optimal experimental setups, running all models on the NYC dataset to guarantee fairness: • Con-GAE 13 : The method was developed to tackle the challenges posed by extreme data sparsity and high dimensionality, specifically to address anomalies in traffic conditions.Moreover, It utilizes context-enhanced graph autoencoders to enhance the effectiveness of anomaly detection.• SuperGAT 53 : A self-supervised graph attention network, uses edge information to guide attention learning.
SuperGAT analyzes two common attention forms, revealing their limitations in capturing label agreement and edge presence, and proposes enhanced attention mechanisms tailored to graph characteristics.
Vol.:(0123456789) Scientific Reports | (2024) 14:1247 | https://doi.org/10.1038/s41598-024-51374-3 • EG 54 : The Efficient Graph Convolution (EGC) method is an isotropic Graph Neural Network (GNN) architec- ture.EGC outperforms comparable anisotropic models like GAT and PNA in terms of accuracy and efficiency.This finding challenges the prevalent belief that anisotropic GNNs are inherently superior.• GraphGPS 55 : A modular and scalable framework designed to build graph transformers, integrating message passing with global attention.This framework also categorizes positional and structural encodings, thereby injecting useful inductive biases.GraphGPS demonstrates state-of-the-art performance in various graph learning tasks and scales effortlessly to thousands of nodes.• GATv2 17 : Graph Attention Networks (GATs) are limited by their computation of restricted "static" attention, inhibiting their ability to dynamically prioritize neighbors.To overcome this limitation, GATv2 alters the order of operations in the scoring function, enabling more expressive dynamic attention.• Dir-GNN 56 : The method enhances message passing neural networks (MPNNs) by incorporating edge direc- tionality and conducting distinct aggregations for incoming and outgoing edges.Moreover, It significantly betters learning on heterophilic graphs, where neighboring nodes often have different labels, and performance on homophilic graphs, characterized by label-sharing neighbors.• PMLP 57 : The method introduces propagational MLPs, which employ MLP architecture for training and add message passing layers before inference.This approach bridges the gap between MLPs and GNNs, achieving performance that is comparable to or surpasses that of GNNs.It demonstrates the effectiveness of GNN architectures for generalization, even without training in a graph context.Additionally, PMLPs offer faster and more robust training than GNNs.

Experimental setups
Our experiments were conducted using a GPU 2080TI and an Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz.
Considering the anomaly problem, we experimentally used anomaly injection, randomly selecting time slices θ in each sequence to inject anomalies.Then extract a portion of the time series corresponding uniformly distributed time slices θ for perturbation factors for anomaly perturbation (e.g., 10 am for 10 pm).In this experiment, we set three pollution ratios and magnitudes α , β , and γ on the data set NYC. Anomalies in traffic networks are mainly divided into two types 58,59 : (1) Spatial anomalies: where the current traffic conditions are inconsistent with normal traffic conditions (for example, the flow of traffic vehicles is inconsistent with the normal flow of travel in the past).( 2) Temporal anomalies: where the current traffic conditions conform to the normal spatial pattern, but not to the current time.In this paper, we perform some anomaly handling on the dataset: Let γ represent the proportion of time slices randomly selected for contamination, which is applicable to the injection of both spatial and temporal anomalies; Let α denote the proportion of origin-destination pairs selected for contamination; Let β defines the range of the uniform distribution used to perturb the travel time.In fact, β defines the magnitude of spatial anomalies, i.e., the maximum possible value of travel time perturbation.
In the experiments, we adjust the levels of pollution ratios and magnitudes ( α , β , and γ ), to evaluate the effec- tiveness of anomaly detection under different scenarios.The specific steps are as follows: For spatial anomalies, we first randomly select a certain proportion ( γ ) of time slices and randomly choose a certain proportion ( α ) of origin-destination pairs in each contaminated time slice, and then perturb the travel time of these pairs by factors drawn from the uniform distribution U(−β , β ).Temporal anomalies are created by randomly selecting a certain proportion ( γ ) of time slices and shifting the time in the data by 12 hours (e.g., changing 8 PM to 8 AM, and vice versa).We set α ∈ {25%, 50%} , β ∈ {5%, 10%, 20%} , and γ ∈ {5%, 10%, 20%}.
For the training process, we initially set the epoch number at 150 and the batch size at 10 per epoch.In the previously mentioned day of the week and hour of the day metrics, set both h day and h hour to 100, and the dimen- sion of the graph embedding we set to 150 and 50, respectively, the discard rate was set to 0.2, the learning rate is 0.001 by default.Then, we set the learning rate decay in the process, each time, the growth is 0.1 times the last learning rate so that the model can learn the parameters better.Finally, we selected the NYC datasets from January 8 to March 31, 2019, as the training set and extracted 10% from it for validation.We used the NYC datasets from January 1 to January 7, 2019, and a portion of the Uber Movement as the test set.Note that the sampling process was based on uniform distribution random sampling, and both training set and test set were mutually exclusive (i.e., the same data point would not appear in both the training set and test set).
In addition, ablation experiments were performed on the PEMS dataset to verify the effectiveness of the proposed module, which was evaluated using MAE metrics.The loss functions that MAE and RMSE are more credible test methods in some anomaly detection, especially in the traffic area 3 .Six epochs were set for training.Each period was divided into 128 batches, the generator loss function was 500, the learning rate was 0.001, and the decayed by a factor of 0.1 per epoch.In this dataset, we set the number of layers of TCN to 9 and transformed the head nodes in GCN to GAT to improve the model's parallelism.In learning, we set the hidden layer to 64.

Result and analysis
(1) Comparison with state-of-the-art work Initially, we compared our MTGAE model with some baseline models using the AUC as evaluation metric.The calculation of AUC considers both the classification ability of the classifier for positive and negative cases, which can still make a reasonable evaluation of the classifier in the case of sample imbalance.We fixed α = 50% and β = 10% in the pollution magnitude and used the anomaly rate to compare the model's ability to detect anomalies.It can be seen from Table 1 that our MTGAE is significantly better than the other models.Our model outperforms others by about 0.1-0.4 at different anomaly rates.
After that, we fixed the time slice γ of pollution to study the abnormal magnitude to change the pollution mag- nitude differently.As shown in Table 2, we controlled α as 25% and 50% respectively, and β was controlled as the same pollution magnitude under α .We can see that our models are higher than the baseline but in the higher β .For example, the AUC of most models with α = 50% and β = 20% is above 0.9, and most of the baseline models unless combined with the Gaussian kernel function for processing, which illustrates the importance of MTCM and GCGRU CELL for anomaly detection.Notably, incorporating a mirror into TCN significantly improves its efficacy in enhancing GCGRU CELL performance, this demonstrates superior ability in capturing both long and short-term memory and temporal information in time series.
(3) Real world reflects abnormal traffic We used the NYC dataset from January 1, 2019, to January 7, 2019, to test the real-world traffic situation to prove the effectiveness of our model.We used the reconstruction loss to represent the possibility of anomalies, as shown in Fig. 6.January 4 is Friday in the real world, and we can see that the possibility of anomalies in the afternoon distribution of this day is very intensive, from which we can infer that Black Friday Shopping is prone to traffic anomalies due to traffic jams.
(4) Sensitivity analysis To study how MTGAE varies for weekly, hourly, and node embedding, we put α = 50% and β = 10% γ = 10% .We explored the model's affectivity on spacetime, and we changed the dimen- sion of node embedding to 25 to 200 (the dimensionality is acceptable for the first GCN and the second GCN) and the week and hour dimension of temporal embedding to 10 to 200 for training.As shown in Fig. 6b, our model does not change much, and the AUCs all remain between 0.9 and 1, indicating that our model works well in most environments.Moreover, we can further see that the AUC of our model is lower when the time node embedding is large than when the embedding is small.

Figure 1 .
Figure 1.Illustration of two parts of NYC dataset graph (first column) and corresponding weighted adjacency matrix (second column) corresponding to scaled inverse travel times between points on the graph.Note that the mobility data is modeled as a series of time-dependent directed weighted graphs.(a) A portion of the visualizable NYC mobility data.(b) the adjacency matrix corresponding to subfigure (a).(c) Another portion of the visualizable NYC mobility data, but it is not part of subgraph (a).(d) the adjacency matrix corresponding to subfigure (c).

Figure 2 .
Figure 2. The architecture of the MTGAE.The architecture consists of two main components: an Encoder and a Decoder.The Encoder includes an adaptive process, the MTCM, and the GCGRU CELL.The MTCM is designed to effectively capture relevant information from data of variable length.It incorporates the TCN for processing data that has undergone a 'mirroring' transformation, adjusting the length of mirrored data back to its original state prior to input.The GCGRU CELL, comprising a Gaussian Kernel and Graph Convolutional Networks (GCNs), is instrumental in mapping finite-dimensional data to a higher dimensional space.This mapping aids in anomaly detection while preserving data distribution.The GCNs within the GCGRU CELL leverage the GRU architecture to extract spatial-temporal dependencies.Lastly, the Decoder's primary function, facilitated by the Bilinear module, is to resample features derived from the Encoder's output, enhancing the overall data interpretation process.

Figure 3 .
Figure 3. (a) The illustration explains: MTCM uses a hole convolution kernel with a size of 2. The left x m is the mirror image feature of x , uses the expansion factor K, selects the input of each k step, and then uses 1D convolution.(b) The figure explains how an embedded node and surrounding embedded nodes are connected through GCN, where the orange node is the original node, and its neighbour nodes are white and enclosed in the ellipse.(c) The figure explains ur GCN layer is different from the original GCN, our GCN layer can associate more sub-nodes.(d) The diagram shows how our GCN layer is associated with its child nodes (blue nodes) through the example orange node, and then the child nodes (blue nodes) spread to its child nodes (yellow nodes).

Figure 4 .
Figure 4.The image shows the GRU architecture, on which we were inspired to change the gating unit of GRU to GCN, giving it the same ability as GRU to capture short-term and long-term dependencies in a sequence.The s (t) update gating and z (t) reset gating are reflected in the derivation equations of GCGRU in this paper.

Figure 5 .
Figure 5.We extracted the intermediate variables before and after using the Gaussian kernel module to visually demonstrate this module's importance in our model.(a) Before the Gaussian kernel module.(b) After the Gaussian kernel module.

( 5 )
Generalization ability To explore the generalization ability of MTGAE, we performed experiments on a large-scale dynamic graph dataset DGraphFin in the financial domain 60 .It contains over 3.7 million nodes and 4.3 million dynamic edges.Nodes represent financial loan users, and directed edges represent emergency contact relationships.Each dimension represents 17 different elements of personal profiles, such as age and gender.Among the nodes in the dataset, 15,509 are categorized as fraudsters, 1,210,092 as normal users, and the remaining 66.8% of nodes (2,474,949 nodes) are registered users who have not borrowed from the platform.Based on the officially published baseline and code, we input the DGraphFin data into our MTGAE, then carry out feature learning through the 17 features in the structure of MTGAE, and finally divide into two categories (other baselines also divide into two categories) for anomaly detection, with results shown in the

Figure 6 .
Figure 6.(a) Our model's ability to detect traffic anomalies.The horizontal axis denotes an hour, the vertical axis denotes date, and the color depth indicates the possibility of traffic anomalies (reconstruction loss).(b) The sensitivity experiment of the model.It is guaranteed to be between 0.9 and 1.0 under different node embedding and time embedding.

Table 4 .
The AUC scores of our model and other baselines on the DGraphFin dataset.